Supermarket python project
Importing all needful libraries and csv file with data
import pandas as pd
import math
import plotly.express as px
import plotly.graph_objects as go
from IPython.core.display import display, HTML
sales = pd.read_csv('supermarket_sales - Sheet1.csv', header=0)
sales
/var/folders/8q/d84brk7924n4jpj915_z1nm00000gn/T/ipykernel_21161/1561358757.py:5: DeprecationWarning: Importing display from IPython.core.display is deprecated since IPython 7.14, please import from IPython.display
from IPython.core.display import display, HTML
| Invoice ID | Branch | City | Customer type | Gender | Product line | Unit price | Quantity | Tax 5% | Total | Date | Time | Payment | cogs | gross margin percentage | gross income | Rating | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 750-67-8428 | A | Yangon | Member | Female | Health and beauty | 74.69 | 7 | 26.1415 | 548.9715 | 1/5/2019 | 13:08 | Ewallet | 522.83 | 4.761905 | 26.1415 | 9.1 |
| 1 | 226-31-3081 | C | Naypyitaw | Normal | Female | Electronic accessories | 15.28 | 5 | 3.8200 | 80.2200 | 3/8/2019 | 10:29 | Cash | 76.40 | 4.761905 | 3.8200 | 9.6 |
| 2 | 631-41-3108 | A | Yangon | Normal | Male | Home and lifestyle | 46.33 | 7 | 16.2155 | 340.5255 | 3/3/2019 | 13:23 | Credit card | 324.31 | 4.761905 | 16.2155 | 7.4 |
| 3 | 123-19-1176 | A | Yangon | Member | Male | Health and beauty | 58.22 | 8 | 23.2880 | 489.0480 | 1/27/2019 | 20:33 | Ewallet | 465.76 | 4.761905 | 23.2880 | 8.4 |
| 4 | 373-73-7910 | A | Yangon | Normal | Male | Sports and travel | 86.31 | 7 | 30.2085 | 634.3785 | 2/8/2019 | 10:37 | Ewallet | 604.17 | 4.761905 | 30.2085 | 5.3 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 995 | 233-67-5758 | C | Naypyitaw | Normal | Male | Health and beauty | 40.35 | 1 | 2.0175 | 42.3675 | 1/29/2019 | 13:46 | Ewallet | 40.35 | 4.761905 | 2.0175 | 6.2 |
| 996 | 303-96-2227 | B | Mandalay | Normal | Female | Home and lifestyle | 97.38 | 10 | 48.6900 | 1022.4900 | 3/2/2019 | 17:16 | Ewallet | 973.80 | 4.761905 | 48.6900 | 4.4 |
| 997 | 727-02-1313 | A | Yangon | Member | Male | Food and beverages | 31.84 | 1 | 1.5920 | 33.4320 | 2/9/2019 | 13:22 | Cash | 31.84 | 4.761905 | 1.5920 | 7.7 |
| 998 | 347-56-2442 | A | Yangon | Normal | Male | Home and lifestyle | 65.82 | 1 | 3.2910 | 69.1110 | 2/22/2019 | 15:33 | Cash | 65.82 | 4.761905 | 3.2910 | 4.1 |
| 999 | 849-09-3807 | A | Yangon | Member | Female | Fashion accessories | 88.34 | 7 | 30.9190 | 649.2990 | 2/18/2019 | 13:28 | Cash | 618.38 | 4.761905 | 30.9190 | 6.6 |
1000 rows × 17 columns
Descriptive statistics
There is mean and standart deviation of different fields
sales.describe()
| Unit price | Quantity | Tax 5% | Total | cogs | gross margin percentage | gross income | Rating | |
|---|---|---|---|---|---|---|---|---|
| count | 1000.000000 | 1000.000000 | 1000.000000 | 1000.000000 | 1000.00000 | 1000.000000 | 1000.000000 | 1000.00000 |
| mean | 55.672130 | 5.510000 | 15.379369 | 322.966749 | 307.58738 | 4.761905 | 15.379369 | 6.97270 |
| std | 26.494628 | 2.923431 | 11.708825 | 245.885335 | 234.17651 | 0.000000 | 11.708825 | 1.71858 |
| min | 10.080000 | 1.000000 | 0.508500 | 10.678500 | 10.17000 | 4.761905 | 0.508500 | 4.00000 |
| 25% | 32.875000 | 3.000000 | 5.924875 | 124.422375 | 118.49750 | 4.761905 | 5.924875 | 5.50000 |
| 50% | 55.230000 | 5.000000 | 12.088000 | 253.848000 | 241.76000 | 4.761905 | 12.088000 | 7.00000 |
| 75% | 77.935000 | 8.000000 | 22.445250 | 471.350250 | 448.90500 | 4.761905 | 22.445250 | 8.50000 |
| max | 99.960000 | 10.000000 | 49.650000 | 1042.650000 | 993.00000 | 4.761905 | 49.650000 | 10.00000 |
Median of different columns
print(sales.Quantity.median())
print(sales.Rating.median())
print(int(sales.Total.median()))
5.0
7.0
253
Looking at columns in detail
columns_names = list(sales.columns)
columns = pd.DataFrame(columns_names, columns = ['Names of columns'])
columns
| Names of columns | |
|---|---|
| 0 | Invoice ID |
| 1 | Branch |
| 2 | City |
| 3 | Customer type |
| 4 | Gender |
| 5 | Product line |
| 6 | Unit price |
| 7 | Quantity |
| 8 | Tax 5% |
| 9 | Total |
| 10 | Date |
| 11 | Time |
| 12 | Payment |
| 13 | cogs |
| 14 | gross margin percentage |
| 15 | gross income |
| 16 | Rating |
Looking on different product lines presented in dataset
d = pd.DataFrame(set(list(sales['Product line'])))
d
| 0 | |
|---|---|
| 0 | Food and beverages |
| 1 | Fashion accessories |
| 2 | Home and lifestyle |
| 3 | Sports and travel |
| 4 | Electronic accessories |
| 5 | Health and beauty |
Data cleanup
Checking for zero and "NaN" elements
sales.count()
Invoice ID 1000
Branch 1000
City 1000
Customer type 1000
Gender 1000
Product line 1000
Unit price 1000
Quantity 1000
Tax 5% 1000
Total 1000
Date 1000
Time 1000
Payment 1000
cogs 1000
gross margin percentage 1000
gross income 1000
Rating 1000
dtype: int64
Therefore there is no such elements
Format column names
sales['Total'] = sales['Total'].apply(lambda x: round(x, 2))
sales['gross income'] = sales['gross income'].apply(lambda x: round(x, 2))
sales = sales.drop(['Invoice ID', 'Tax 5%', 'Branch', 'gross margin percentage', 'cogs'], axis = 1)
sales.rename(columns={'gross income': 'Gross income'}, inplace=True)
sales
| City | Customer type | Gender | Product line | Unit price | Quantity | Total | Date | Time | Payment | Gross income | Rating | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Yangon | Member | Female | Health and beauty | 74.69 | 7 | 548.97 | 1/5/2019 | 13:08 | Ewallet | 26.14 | 9.1 |
| 1 | Naypyitaw | Normal | Female | Electronic accessories | 15.28 | 5 | 80.22 | 3/8/2019 | 10:29 | Cash | 3.82 | 9.6 |
| 2 | Yangon | Normal | Male | Home and lifestyle | 46.33 | 7 | 340.53 | 3/3/2019 | 13:23 | Credit card | 16.22 | 7.4 |
| 3 | Yangon | Member | Male | Health and beauty | 58.22 | 8 | 489.05 | 1/27/2019 | 20:33 | Ewallet | 23.29 | 8.4 |
| 4 | Yangon | Normal | Male | Sports and travel | 86.31 | 7 | 634.38 | 2/8/2019 | 10:37 | Ewallet | 30.21 | 5.3 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 995 | Naypyitaw | Normal | Male | Health and beauty | 40.35 | 1 | 42.37 | 1/29/2019 | 13:46 | Ewallet | 2.02 | 6.2 |
| 996 | Mandalay | Normal | Female | Home and lifestyle | 97.38 | 10 | 1022.49 | 3/2/2019 | 17:16 | Ewallet | 48.69 | 4.4 |
| 997 | Yangon | Member | Male | Food and beverages | 31.84 | 1 | 33.43 | 2/9/2019 | 13:22 | Cash | 1.59 | 7.7 |
| 998 | Yangon | Normal | Male | Home and lifestyle | 65.82 | 1 | 69.11 | 2/22/2019 | 15:33 | Cash | 3.29 | 4.1 |
| 999 | Yangon | Member | Female | Fashion accessories | 88.34 | 7 | 649.30 | 2/18/2019 | 13:28 | Cash | 30.92 | 6.6 |
1000 rows × 12 columns
Plots
Amount of products by each product line
dfk = sales.groupby("Product line")['Quantity'].sum()
dfk = pd.DataFrame(dfk)
dfk.reset_index(inplace=True)
dfk = dfk.sort_values('Quantity', ascending=True, ignore_index=True)
fig = px.bar(dfk, x='Product line', y='Quantity')
fig.write_html('g1.html')
display(HTML('g1.html'))
Product line comparison to unit price
fig=px.box(sales,x='Product line',color='Product line',y='Unit price')
fig.show()
Amount of total sum of purchases by product lined
df = pd.DataFrame({'Line': sales['Product line'], 'City': sales['City'], 'Total': sales['Total']})
df = pd.DataFrame(df.groupby('Line')['Total'].sum())
fig = px.bar(df.reset_index(), x='Line', y='Total')
fig.show()
Amount of total sum of purchases by city
df = pd.DataFrame({'line': sales['Product line'], 'City': sales['City'], 'Total': sales['Total']})
df = pd.DataFrame(df.groupby('City')['Total'].sum())
fig = px.bar(df.reset_index(), x='City', y='Total')
fig.show()
Rating by time of purchases
sale = pd.DataFrame({'Time': sales['Time'], 'Rating': sales['Rating'], 'Total': sales['Total']})
sale = sale.sort_values('Time')
fig = px.scatter(sale, x="Time", y="Rating")
fig.show()
Total sum of purchases by time of purchases
fig = px.scatter(sale, x="Time", y="Total")
fig.show()
Relation of payment methods and total sum of purchases
fig1 = px.pie(sales, names='Payment', values='Total')
fig1.show()
Relation between membership and total sum of purchases and rating
fig = px.pie(sales, values="Total", names="Customer type")
fig.show()
fig = px.pie(sales, values="Rating", names="Customer type")
fig.show()
Detailed overview
What product line gives more total sum of products and in what city
df = pd.DataFrame({'line': sales['Product line'], 'city': sales['City'], 'total': sales['Total']})
df = df.sort_values(by='city', ignore_index=True)
fig = px.sunburst(df, path=['city', 'line'], values='total')
fig.update_layout(margin = dict(t=1, l=1, r=1, b=1))
fig.show()
Membershiping by sex
df = pd.DataFrame({'sex': sales['Gender'], 'Customer type': sales['Customer type']})
male_member = 0
male_notmem = 0
female_member = 0
female_notmem = 0
df_sorted = df.sort_values(by='sex', ascending=False)
#count male member/not member
for i in range(501):
if df_sorted.iloc[i, 1] == 'Member':
male_member += 1
else:
male_notmem += 1
for i in range(501, 1000):
if df_sorted.iloc[i, 1] == 'Member':
female_member += 1
else:
female_notmem += 1
df1 = pd.DataFrame({'male': [male_member, male_notmem] , 'female': [female_member, female_notmem], 'membership': ['member', 'not_member']})
df_transformed = df1.melt(id_vars='membership', var_name='sex', value_name='count')
fig = px.bar(df_transformed, x='sex', y='count', color='membership', barmode='group', height=350)
fig.show()
Total sum and it dependence of date, time and city
ndf = pd.DataFrame({'date': sales['Date'], 'time': sales['Time'], 'city': sales['City'], 'total': sales['Total']})
ndf = ndf.sort_values(['date', 'time'], ignore_index=True, ascending=True)
fig = px.scatter_3d(ndf, x='date', y='time', z='total',
color='city', size_max=18,
symbol='city', opacity=0.7)
# tight layout
fig.update_layout(margin=dict(l=0, r=0, b=0, t=0))
fig.show()
We can note that there are interesting trendline each city
Lets look more datailed on them
Mandalay
ndf = ndf[ndf['city'] == 'Mandalay']
ndf = ndf.sort_values(['date', 'time'], ignore_index=True, ascending=True)
fig = px.scatter_3d(
ndf,
x='date',
y='time',
z='total',
color='city',
size_max=18,
symbol='city',
opacity=0.7
)
fig.update_layout(margin=dict(l=0, r=0, b=0, t=0))
fig.show()
fig = px.scatter(ndf, x="time", y="date")
fig.show()
ndf = ndf.sort_values(['time', 'date'], ignore_index=True, ascending=True)
fig = px.scatter(ndf, x="time", y="date")
fig.show()
Yangon
ndf = pd.DataFrame({'date': sales['Date'], 'time': sales['Time'], 'city': sales['City'], 'total': sales['Total']})
ndf = ndf[ndf['city'] == 'Yangon']
ndf = ndf.sort_values(['date', 'time'], ignore_index=True, ascending=True)
fig = px.scatter_3d(
ndf,
x='date',
y='time',
z='total',
color='city',
size_max=18,
symbol='city',
opacity=0.7
)
fig.update_layout(margin=dict(l=0, r=0, b=0, t=0))
fig.show()
ndf = ndf.sort_values(['date', 'time'], ignore_index=True, ascending=True)
fig = px.scatter(ndf, x="time", y="date")
fig.show()
ndf = ndf.sort_values(['time', 'date'], ignore_index=True, ascending=True)
fig = px.scatter(ndf, x="time", y="date")
fig.show()
Naypyitaw
ndf = pd.DataFrame({'date': sales['Date'], 'time': sales['Time'], 'city': sales['City'], 'total': sales['Total']})
ndf = ndf[ndf['city'] == 'Naypyitaw']
ndf = ndf.sort_values(['date', 'time'], ignore_index=True, ascending=True)
fig = px.scatter_3d(
ndf,
x='date',
y='time',
z='total',
color='city',
size_max=18,
symbol='city',
opacity=0.7
)
fig.update_layout(margin=dict(l=0, r=0, b=0, t=0))
fig.show()
ndf = ndf.sort_values(['date', 'time'], ignore_index=True, ascending=True)
fig = px.scatter(ndf, x="time", y="date")
fig.show()
ndf = ndf.sort_values(['time', 'date'], ignore_index=True, ascending=True)
fig = px.scatter(ndf, x="time", y="date")
fig.show()
Overall, we can notice similar graph trends for each city This may be due to different factors: working hours, growing popularity of the supermarket and etc.
Average total sum in half hours
data = []
for i in range(10, 21):
first_halfhour = generate_time_list(i, 0, i, 29)
sec_halfhour = generate_time_list(i, 30, i, 59)
total_f = total_s = cnt = cnto = 0
for j in range(1000):
if dfr['time'].iloc[j] in first_halfhour:
cnt += 1
total_f += int(dfr['total'].iloc[j])
if dfr['time'].iloc[j] in sec_halfhour:
cnto += 1
total_s += int(dfr['total'].iloc[j])
totalfi = round(total_f / cnt, 2) if cnt else 0
totalse = round(total_s / cnto, 2) if cnto else 0
data.append([f"{str(i).zfill(2)}:00 - {str(i).zfill(2)}:29", totalfi])
data.append([f"{str(i).zfill(2)}:30 - {str(i).zfill(2)}:59", totalse])
# Create a DataFrame from the data
df_result = pd.DataFrame(data, columns=['Time Range', 'Average Total'])
df_result
| Time Range | Average Total | |
|---|---|---|
| 0 | 10:00 - 10:29 | 262.82 |
| 1 | 10:30 - 10:59 | 357.47 |
| 2 | 11:00 - 11:29 | 360.95 |
| 3 | 11:30 - 11:59 | 316.12 |
| 4 | 12:00 - 12:29 | 273.05 |
| 5 | 12:30 - 12:59 | 307.42 |
| 6 | 13:00 - 13:29 | 353.33 |
| 7 | 13:30 - 13:59 | 322.05 |
| 8 | 14:00 - 14:29 | 310.98 |
| 9 | 14:30 - 14:59 | 426.67 |
| 10 | 15:00 - 15:29 | 339.83 |
| 11 | 15:30 - 15:59 | 266.19 |
| 12 | 16:00 - 16:29 | 393.54 |
| 13 | 16:30 - 16:59 | 271.79 |
| 14 | 17:00 - 17:29 | 373.14 |
| 15 | 17:30 - 17:59 | 288.92 |
| 16 | 18:00 - 18:29 | 297.94 |
| 17 | 18:30 - 18:59 | 256.90 |
| 18 | 19:00 - 19:29 | 366.04 |
| 19 | 19:30 - 19:59 | 338.32 |
| 20 | 20:00 - 20:29 | 278.70 |
| 21 | 20:30 - 20:59 | 332.16 |
Look at origin graph
fig = px.scatter(sale, x="Time", y="Total")
fig.show()
fig = px.line(df_result, x="Time Range", y="Average Total")
fig.show()
Change in the Total price of Country according to date
px.bar(sales, x='City',y='Total',color='City',animation_frame='Date',
animation_group="City", range_y=[0,1000])
Data transformation
Creating two new columns of different interesting ratios
sales['Price-rating ratio'] = sales['Total']/sales['Rating']
sales['Price-rating ratio'] = sales['Price-rating ratio'].astype('int64')
sales['Qnty/price'] = sales['Total']/sales['Quantity']
sales['Qnty/price'] = sales['Qnty/price'].apply(lambda x: round(x, 2))
sales
| City | Customer type | Gender | Product line | Unit price | Quantity | Total | Date | Time | Payment | Gross income | Rating | Price-rating ratio | Qnty/price | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Yangon | Member | Female | Health and beauty | 74.69 | 7 | 548.97 | 1/5/2019 | 13:08 | Ewallet | 26.14 | 9.1 | 60 | 78.42 |
| 1 | Naypyitaw | Normal | Female | Electronic accessories | 15.28 | 5 | 80.22 | 3/8/2019 | 10:29 | Cash | 3.82 | 9.6 | 8 | 16.04 |
| 2 | Yangon | Normal | Male | Home and lifestyle | 46.33 | 7 | 340.53 | 3/3/2019 | 13:23 | Credit card | 16.22 | 7.4 | 46 | 48.65 |
| 3 | Yangon | Member | Male | Health and beauty | 58.22 | 8 | 489.05 | 1/27/2019 | 20:33 | Ewallet | 23.29 | 8.4 | 58 | 61.13 |
| 4 | Yangon | Normal | Male | Sports and travel | 86.31 | 7 | 634.38 | 2/8/2019 | 10:37 | Ewallet | 30.21 | 5.3 | 119 | 90.63 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 995 | Naypyitaw | Normal | Male | Health and beauty | 40.35 | 1 | 42.37 | 1/29/2019 | 13:46 | Ewallet | 2.02 | 6.2 | 6 | 42.37 |
| 996 | Mandalay | Normal | Female | Home and lifestyle | 97.38 | 10 | 1022.49 | 3/2/2019 | 17:16 | Ewallet | 48.69 | 4.4 | 232 | 102.25 |
| 997 | Yangon | Member | Male | Food and beverages | 31.84 | 1 | 33.43 | 2/9/2019 | 13:22 | Cash | 1.59 | 7.7 | 4 | 33.43 |
| 998 | Yangon | Normal | Male | Home and lifestyle | 65.82 | 1 | 69.11 | 2/22/2019 | 15:33 | Cash | 3.29 | 4.1 | 16 | 69.11 |
| 999 | Yangon | Member | Female | Fashion accessories | 88.34 | 7 | 649.30 | 2/18/2019 | 13:28 | Cash | 30.92 | 6.6 | 98 | 92.76 |
1000 rows × 14 columns
Two example graph using new columns
fig = px.pie(sales, values='Price-rating ratio', names='City')
fig.show()
fig = px.bar(sales, x='Qnty/price', y='Product line')
fig.show()
Hypothesis
City and customer type influence total
grouped_data = sales.groupby(['City', 'Customer type', 'Product line'])['Total'].sum().reset_index()
# Create a faceted bar chart
fig1 = px.bar(grouped_data, x='Product line', y='Total', color='Customer type',
facet_col='City')
fig1.update_layout(height=400, width=1000)
fig1.show()
In each city, normal customers consistently contribute more to total sales across all product lines, with significant variation in sales by product type.
Higher Unit Prices Lead to Lower Quantities Sold
fig1 = px.scatter(sales, x='Quantity', y='Unit price', trendline='lowess')
fig1.show()
The flat trendline indicates no clear relationship between unit price and quantity sold, suggesting that price changes do not significantly impact sales quantity.
Naypyitaw has more total sales per each customer type
city_customer_sales = sales.groupby(['City', 'Customer type'])['Total'].sum().reset_index()
# Create a bar chart
fig1 = px.bar(city_customer_sales, x='City', y='Total', color='Customer type')
fig1.show()
Naypyitaw has the highest total sales for both member and normal customer types compared to Mandalay and Yangon
Rating is not infuenced by gross income, total and payment method
fig = px.scatter(sale, x="Total", y="Rating", trendline='lowess')
fig.show()
The trendline is nearly flat, suggesting that the total sales amount does not significantly affect customer ratings.
fig3 = px.scatter(sales, x='Gross income', y='Rating', trendline='lowess')
fig3.show()
Similar to the previous plot, the trendline remains flat, indicating that gross income has little to no impact on customer ratings.
fig4 = px.box(sales, x='Payment', y='Rating')
fig4.show()
The ratings are similar across eWallet, cash, and credit card payment methods, with overlapping interquartile ranges.
The overall hypothesis that rating is not influenced by gross income, total sales, or payment method is supported by the data. Each graph consistently shows no significant correlation between these variables and customer ratings.